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Rudi Cilibrasi and Paul Vitanyi have demonstrated that it is possible to extract the meaning of 
words from the world-wide web. To achieve this, they rely on the number of webpages that are found 
through a Google search containing a given word and they associate the page count to the probability 
that the word appears on a webpage. Thus, conditional probabilities allow them to correlate one 
word with another word's meaning. Furthermore, they have developed a similarity distance function 
that gauges how closely related a pair of words is. We present a specific counterexample to the 
triangle inequality for this similarity distance function. 



I. INTRODUCTION 



When the Google search engine is used to search for 
word x, Google displays the number of hits that word x 
has. The ratio of this number of hits to the total number 
of webpages indexed by Google represents the probabil- 
ity that wor d x appears on a webpage. Cilibrasi and 
Vitanyi [CV| use this probability to extract the mean- 
ing of words from the world-wide-web. If word y has a 
higher conditional probability to appear on a webpage, 
given that word x also appears on the webpage, than it 
does by itself, then it can be concluded that words x and 
y are related. Moreover, higher conditional probabilities 
imply a closer relationship between the two words. Thus, 
word x provides some meaning to word y and vice versa. 

Cilibrasi and Vitanyi's normalized Google distance 
(NGD) function measures how close word x is to word 
y on a zero to infinity scale. A distance of zero indicates 
that the two words are practically the same. Two in- 
dependent words have a distance of one. A distance of 
infinity occurs for two words that never appear together. 

Although Cilibrasi and Vitanyi's NGD function was 
sensibly derived from its basic axioms (which allow it to 
theoretically yield the values mentioned above), it does 
not account for the presence of multi-thematic webpages. 
In other words, the NGD function does not account for 
webpages, such as dictionary sites and other long pages, 
which encompass many unrelated subjects. For this rea- 
son, it is necessary to renormalize the NGD formula to 
achieve the desired values. 



II. NGD'S EXPECTATION VALUE 

On the average, two random words should be indepen- 
dent of one another. Hence, two random words should 
have an NGD of one. To test this assumption, we ran- 
domly selected two sets of five words from the dictionary. 
In addition, we randomly chose one set often words from 
two different news articles (five words were taken from 
each article). We then proceeded to evaluate the NGD 
among the different word pairs in each set using Cilibrasi 
and Vitanyi's formula 



NGD(x, y) 



max {log /(x), log /(y)} - log f(x,y) 
logM - mia {log /(ac), log /(y)} 
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(1) 

where f(x) and f(y) are the number of hits of words x 
and y, respectively, and M is the total number of web- 
pages that Google indexes. 

Our first set, which consisted of the words micromete- 
orite, transient, denature, pentameter, and reside, yielded 
an expectation value of 0.64 for the NGD with a stan- 
dard deviation of 0.14. Our second set, which consisted 
of the words detrition, unity, interstice, abrupt, and re- 
side, had an expectation value of 0.75 for the NGD with 
a standard deviation of 0.12. Our last set, which con- 
sisted of the words agency, diabetic, enforcement, federal, 
hormone, illegal, intelligence, measure, spread, and war, 
yielded an expectation value of 0.77 with a standard de- 
viation of 0.15. Averaging the above values (weighing 
them correctly according to the number of words in each 
set), we obtain an expected value of 0.7325 for the NGD. 
A similar evaluation provides us with a standard devia- 
tion of 0.14. 

A different type of analysis brought us to a similar 
expectation value. We call this evaluation the triangle 
difference, (TD). The reason for this name is that we 
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evaluated the following difference: 

TD = NGDfs, y) + NGD(y, z) - NGD(i, z). (2) 

The reason to evaluate such difference stemmed from the 
possibility that the sum of two distances, between words 
x and y, and y and z, might be smaller than the distance 
between x and z. If such were the case, then it would be 
sensible to redefine the distance between two words such 
that it minimizes all possible NGD sums: 

NGD*(x,z) = 

min {NGD(i, y) + NGD(y, z), NGD(ar, z)} , (3) 

for all words y. The triangle difference, however, is only 
violated by extremely rare exceptions, so it is not nec- 
essary to perform such minimization. Nevertheless, we 
proceeded to evaluate the expected triangle difference 
for each of our sets. They were 0.69, 0.79, and 0.60, for 
the first, second, and third sets, respectively. Combining 
them, they yield an expected triangle difference of 0.67. 
This, of course, is close to the expected value of NGD, 
as the triangle difference for random words is 

E [TD] = E [NGD (a;, y)} + E [NGD(y, z)] 
-E\NGD(x,z)] 

E [TD] = 2E [NGD] - E [NGD] = E [NGD] . (4) 

Therefore, the expectation value of the triangle difference 
should be equal to the expectation value of the NGD. A 
rough average of our expectation values obtained through 
each method is 0.7. 



III. NOTES REGARDING NGD 

In an arduous effort to find a set of words that would vi- 
olate the triangle difference, we obtained a set that illus- 
trates a few interesting properties of NGD. The set con- 
sists of the words Rolling Stones, Beatles, and salmonflies 
and it is, among the many word sets that we attempted, 
the only one that violates the triangle difference. Our 
first evaluation of the pertinent distances was the follow- 
ing: 

Word Pair NGD 

Rolling Stones, Beatles 0.23 
Beatles, salmonflies 0.81 
Rolling Stones, salmonflies 1.06 

Table 1. First evaluation of NGD values among the words 
Rolling Stones, Beatles, and salmonflies. 

As can be observed, the NGD between Rolling Stones 
and salmonflies (1.06) is slightly higher than the addition 



of the Google distances between Rolling Stones and Bea- 
tles, and between Beatles and salmonflies (1.04). There- 
fore, our first observed property of NGD is that, even 
in the rare cases in which the triangle difference is vio- 
lated, it is not by much. Furthermore, it is important 
to indicate that our example worked because of the high 
propensity that people have to misspell the word beetle 
as beatle, thus decreasing the distance between Beatles 
and salmonflies. 

The second property that we observed, which deserves 
much attention, is the NGD dependence on the Google 
server to which a user connects. A second evaluation of 
the distances in question yielded the following result: 

Word Pair NGD 

Rolling Stones, Beatles 0.27 
Beatles, salmonflies 0.82 
Rolling Stones, salmonflies 1.14 

Table 1. Second evaluation of NGD values among the words 
Rolling Stones, Beatles, and salmonflies. 

This second set of distances was obtained by connecting 
to Google through a different internet service provider 
and it shows that Google distances are not stable values. 
In fact, from our example, they can vary by as much as 
17% (for the Rolling Stones-Beatles word pair). 

The last property that this set of words depicts is 
the plausibility of Google distances that are higher than 
unity. With an NGD of 1.14, the word pair Rolling 
Stones- salmonflies has the highest distance that we have 
encountered. Indeed, Google distances greater than one 
are very rare, and before this example, the only one we 
had encountered was between the words transient and 
pentameter (1.02). Moreover, these cases are so rare that 
even Cilibrasi and Vitanyi's conjecture about the words 
by and with having an NGD higher than one is false. 
The actual distance, per the Google server currently in 
use, is 0.19. 

IV. CONCLUSIONS 

Although Google can be used to extract the meaning 
of words, it is important to modify Equation 1 in or- 
der to obtain the desired distance values between words. 
The expectation value of NGD, which is the distance 
between two random and therefore independent words, 
is 0.7. To achieve the desired value of unity between in- 
dependent words, it is only necessary to recalibrate the 
NGD formula by dividing by 0.7: 

~ T ^-~*, X NGD(i,h) 

NGD*(.t, ? ;) = ^p^. (5) 

It is also important to remember that NGD values are 
not exact. They depend on the number of hits that each 
word has, which makes them unstable. Factors such as 
the Google server to which one connects and the number 
of websites connected to the world-wide-web can cause 
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discrepancies as high as 17%, which our Rolling Stones- 
Beatles example showed. 
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